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Abstract 

We study a regression model with a huge number of interacting 
variables. We consider a specific approximation of the regression func- 
tion under two assumptions: (i) there exists a sparse representation of 
the regression function in a suggested basis, (ii) there are no interac- 
tions outside of the set of the corresponding main effects. We suggest 
an hierarchical randomized search procedure for selection of variables 
and of their interactions. We show that given an initial estimator, an 
estimator with a similar prediction loss but with a smaller number of 
non-zero coordinates can be found. 



1 Introduction 

Suppose that we observe (l^,Xj), i = 1, . . . ,n, an i.i.d. sample from the 
joint distribution of {Y,X.), where Y ^ TZ, and X = {Xi, . . . ,Xd) ^ Xi x 
■ ■ ■ X = X, with Xj being some subsets of finite-dimensional Euclidean 
spaces. Our purpose is to estimate the regression function /(X) = E{Y\X.) 
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nonparametrically by constructing a suitable parametric approximation of 
this function, with data-dependent values of the parameters. We consider 
the situation where n is large, or even very large and the dimension d is also 
large. Without any assumptions, the problem is cursed by its dimensionality 
even when Xj = TZ for all j. For example, a histogram approximation has 
p = 3^'^ > 10^ parameters when the number of variables is d = 20, and the 
range of each is divided into the meager number of three histogram bins. 

It is common now to consider models where the number of parameters p is 
much larger than the sample size n. The idea is that the effective dimension 
is defined not by the number of potential parameters p but by the (unknown) 
number of non-zero parameters that can be much smaller than n. Methods 



like thresholding in white noise model, cf. Abramovich, Benjamini, Donoho and Johnstone (2006) 



or Golubev (2002) , LASSO, LARS or Dantzig selector in regression, cf, 


Tibshirani (1996) 


Chen, Donoho and Saunders (2001) , Efron, Hastie, Johnstone and Tibshirani (2004 


Candes and Tao (2007) 


are used, and it is proved that if the vector of esti- 



mated parameters is sparse (i.e., the number of non-zero parameters is rel- 
atively small) then the model can be estimated with reasonable accuracy, cf. 



Bunea, Tsybakov and Wegkamp (2007a) [ Bunea, Tsybakov and Wegkamp (2007b) 


Candes and Tao (2007) 


Fu and Knight (2000) 


Greenshtein and Ritov (2004) 




Meinshausen and Biihlmann (2006) 


Meinshausen and Yu (2006) 


Zhang and Huang (2006) 


Zhao and Yu (2006) A direct selection of a small number of non- 


zero vari- 





ables is relatively simple for the white noise model. There, each variable 
is processed separately, and the parameters can be ordered according to 
the likelihood that they are non-zero. The situation is more complicated in 
regression problems. Methods like LASSO and LARS yield numerically effi- 



cient ways to construct a sparse model, cf. 


Juditsky and Nemirovski (2000) 




Nemirovski (2000) 


Osborne, Presnell and Turlach (2000b) 


Osborne, Presnell and Turlach (2000a) 


Efron, Hastie, Johnstone and Tibshirani (2004) 


Turlach (2005) 


However, 



they have their limits, and are not numerically feasible with too many pa- 
rameters, as for instance in the simple example considered above. 

Our aim is to propose a procedure that can work efficiently in such 
situations. We now outline its general scheme. Consider a collection of 
functions (V'jj)i=i,...,cZ, j=o,i,...,L where ipij : Afj — > 7^. For example, for fixed 
i this can be a part of a basis {ipij)j=o,i,... for L2{Xi). For simplicity, we 
take the same number L of basis functions for each variable. We assume 
that ipifl = 1. Consider an approximation of regression function / given 
by: 

d 

je{0,l,...,L}'' i=l 
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where j = {ji,---,jd) and /3j are unknown coefficients. Note tliat fp is 
nothing but a specific model with interactions between variables, such that 
all the interactions are expressed by products of functions of a single variable. 
In fact, since ipi^ = 1, the multi-indices j with only one non-zero coefficient 
yield all the functions of a single variable, those with only two non-zero 
coefficients yield all the products of two such functions, etc. Clearly, this 
covers the above histogram example, wavelet approximations and others. 

The number of coefficients /3j in the model is (L + 1)'^. The LASSO type 
estimator can deal with a large number of potential coefficients which grows 
exponentially in n. So, theoretically, we could throw all the factors into 
the LASSO algorithm and find a solution. But p ^ L'^ \s typically a huge 
number. Although in the theory LASSO can handle that many variables, in 
practice, it becomes numerically infeasible. Therefore, a systematic search 
is needed. 

Since there is no way to know in advance which factors arc significant, 
we suggest a hierarchical selection: we build the model in a tree fashion. At 
each step of the iteration we apply a LASSO type algorithm to a collection 
of candidate functions, where we start with all functions of a single variable. 
Then, from the model selected by this algorithm we extract a sub-model 
which includes only K functions, for some predefined K. The next step of 
the iteration starts with the same candidate functions as its predecessor plus 
all the interactions between the K functions selected at the previous step. 

Formally we consider the following hierarchical model selection method. 
For a set of functions !F with cardinality IJ'^I > K, let M.Sk be some pro- 
cedure to select K functions out of We denote by MSk {^) the selected 
subset of \MSk{J^)\ = K. Also, for a function / : A" ^ 7^, let N(/) be 
the minimal set of indices such that / is a function of (^j)jeN(/) only. The 
procedure is defined as follows. 

(i) Set J^o = utl{V^,l,■■■,V'^,L}• 
(ii) For m = 1,2, ... , let 

J'm = Tra-x U{fg: f,ge MSK{:F^-i),m)f^H9) = 0}- 

(iii) Continue until convergence is declared. The output of the algorithm 
is the set of functions MSxi^'^m) for some m. 

This search procedure is valid under the dictum of no interaction outside 
of the set of the corresponding main effects: a term is included only if it is 
a function of one variable or it is a product of two other included terms. If 
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this is not a valid assumption one can enrich the search at each step to cover 
ah the coefficients /3j of the model. However, this would be cumbersome. 

Note that \Tra\ < + \ J^m-i\ < rnK^ + |^o| = mK'^ + Ld. Thus, 
the set J^m is not excessively large. At every step of the procedure we keep 
for selection all the functions of a single variable, along with not too many 
interaction terms. In other words, functions of a single variable are treated 
as privileged contributors. On the contrary, interactions are considered with 
a suspicion increasing as their multiplicity grows: they cannot be candidates 
for inclusion unless their "ancestors" were included at all the previous steps. 

The final number of selected effects is K by construction. We should 
choose K to be much smaller than n if we want to fit our final model in the 
framework of the classical regression theory. 

One can split the sample in two parts and do model selection and es- 
timation separately. Theoretically, the rate of convergence of the LASSO 
type procedures suffers very little when the procedures are applied only to a 
sub-sample of the observations, as long as the sub-sample size ums used for 
model selection is such that UMs/n converges slowly to 0. We can therefore, 
first use a sub-sample of size ums to select, according to (i)-(iii), a set of K 
terms that we include in the model. The second stage will use the rest of the 
sample and estimate via, e.g., standard least-square method the regression 
coefficients of the K selected terms. 

This paper has two goals. The first one, as described already, is suggest- 
ing a method to build highly complex models in a hierarchial fashion. The 
second purpose is arguing that a reasonable way to do model selection is 
a two stage procedure. The first stage can be based on the LASSO, which 
is an efficient way to obtain sparse representation of a regression model. 
We argue, however, by a way of example in Section [2l that using solely the 
LASSO can be an non-optimal procedure for model selection. Therefore, 
in Section [3] we introduce the second stage of selection, such that a model 
of a desired size is obtained at the end. At this stage we suggest to use 
either randomized methods or the standard backward procedure. We prove 
prediction error bounds for two randomized methods of pruning the result 
of the LASSO stage. Finally, in Section H] we consider two examples that 
combine the ideas presented in this paper. 

2 Model selection: an example 

The above hierarchical method depends on a model selection procedure 
MSk that we need to determine. For high-dimensional case that we are 
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dealing with, LASSO is known to be an efficient model selection tool: it 
is shown that under general conditions the set of non-zero coefficients of 
LASSO estimator coincides with the true set of non-zero coefficients in 
linear regression, with probability converging to 1 as re — > oo (see, e.g., 



Meinshausen and Biihlmann (2006) ; Zhao and Yu (2006) ). However, these 



results depend on strong assumptions that essentially role off anything close 
to multicolinearity. These conditions are often violated in practice when 
there are many variables representing a plentitude of highly related one to 
another demographic and physical measurements of the same subject. They 
are also violated in a common statistical learning setup where the variables 
of the analysis are values of different functions of one real variable (e.g., 
different step functions). Note that for our procedure we do not need to re- 
tain all the non-zero coefficients but just to extract the K "most important" 
ones. To achieve this, we first tried to tune the LASSO in some natural way. 
However, this approach failed. 

We start with an example. We use this example to argue that although 
the LASSO does select a small model (i.e., typically many of the coordinates 
of the LASSO estimator are 0), it does a poor job in selecting the relevant 
variables. A naive approach for model selection when the constraint applies 
to the number of non-zero coefficients, is to relax the LASSO algorithm until 
it yields a solution with the right number of variables. We believe that this 
is a wrong approach. The LASSO is geared for Li constraints and not for 
Lq ones. We suggest another procedure in which we run the LASSO until 
it yields a model more complex than wished, but not too complex, so that 
a standard model selection technique like backward selection can be used. 



This was the method considered in Greenshtein and Ritov (2004) to argue 



that there are model selection methods which are persistent under general 
conditions. 

We first recall the basic definition of LASSO. Consider the linear regres- 
sion model 

y = Z/3o + e 

where y = {Yi, . . . ,Yn)' G TZ^ is the vector of observed responses, Z G 
■j^nxp -g ^YiQ design matrix, /?o € TZ^ is an unknown parameter and e = 
(^1, . . . , ^ri)' € TZ"' is a noise. The LASSO estimator (3^ of /3o is defined as a 
solution of the minimization problem 

min \\y-Z/3f (1) 
where T > is a tuning parameter, is the ^i-norm of (3 and || • || is the 
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empirical norm associated to the sample of size n: 



n 



This is the formulation of the LASSO as given in Tibshirani (1996) Another 
formulation, given below in 
with Li penalty. 



is that of minimization of the sum of squares 
Clearly, ([1]) is equivalent to ([8]) with some constant r 
dependent on T and on the data, by the Lagrange argument. The standard 



LARS-like algorithm of Efron et al. (2004) , which is the algorithm we used 



is based on gradual relaxation of the constraint T of equation ([T]) , and solves 
therefore simultaneously both problems. The focus of this paper is the 
selection of a model of a given size. Hence we apply the LARS algorithm 
until we get for the first time a model of a prescribed size. 



Example 2.1 We consider a linear regression model with 100 i.i.d. observa- 
tions of (y, Zi, . . . , Zi5o) where the predictors {Zi, . . . , Zi^q) are i.i.d. stan- 
dard normal, the response variable is y = X^jE?! Pj^j^^ = Si=i 25+j2 '^i~*~^' 
and the measurement error is ^ ~ N(0,a'^), a = 0.1. 

Note that we have more variables than observations but most of the Pj 
are zero. 

Figure [2 presents the regularization path, i.e. the values of the coeffi- 
cients oi Pl as a function of T in ([1]). The vertical dashed lines indicate the 
values of the T for which the number of non-zero coefficients of Pl is for the 
time larger than the mark value (multiple values of 5). The legend on the 
right gives the value of the 20 coefficients with the highest values (sorted by 
the absolute value of the coefficient). 

Figure [B presents a similar situation. In fact, the only difference is 
that the correlation between any two Zj's is now 0.5. Again, the 10 most 
important variables are those with non-zero true values. 

Suppose we knew in advance that there are exactly 10 non-zero coef- 
ficients. It could be assumed that LASSO can be used, stopped when it 
first finds 10 non-zero coefficients (this corresponds to T « 0.5 in Figure [U 
b). However, if that was the algorithm, then only three coefficients with 
non-zero true value, /^s, Ps, and Piq, were included together with some 7 
unrelated variables. For T ~ 2 the 10 largest coefficients do correspond to 
the 10 relevant variables, but along with them many unrelated variables are 
still selected (8 variables in Figure [B), and moreover this particular choice 
of T cannot be known in advance if we deal with real data. 
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3 Randomized selection 



The approach to design the model selector MSk that we believe should be 
used is the one applied in the examples of Section HI It acts as follows: run 
the LASSO for a large model which is strictly larger than the model we want 
to consider, yet small enough so that standard methods for selecting a good 
subset of the variables can be implemented. Then run one of such methods, 
with given subset size K: in the examples of Section H] we use the standard 
backward selection procedure. We do not have a mathematical proof which 
is directly relevant to such a method. We can prove, however, the validity 
of an inferior backward method which is based on random selection (with 
appropriate weights) of the variable to be dropped at each stage. We bound 
the increase in the sum of squares of the randomized method. The same 
bounds are applied necessarily to the standard backward selection. 

Suppose that we have an arbitrary estimator j3 with values in TZ^, not 
necessarily the LASSO estimator. We may think, for example, of any esti- 
mator of parameter Pq in the linear model of Section [2l but our argument is 
not restricted to that case. We now propose a randomized estimator /3 such 
that: 

(A) the prediction risk of /3 is on the average not too far from that of /3, 

(B) P has at most K non-zero components, 

(C) large in absolute value components of /? coincide with those of (3. 

Definition of the randomization distribution. Let X be the set of non-zero 
coordinates of the vector (3 = . . . ,(3p). We suppose that its cardinality 
K = \I\ > 2. Introduce the values 

Pi = mm{l, c{K -l)m/m\i}, iel, 

where c > 1 is a solution of J2i£iPi = K — Such c exists since the function 

t ^ Pi{t) = mm{l,t(K mil} 

is continuous and non-decreasing, limj^oo YlieJpii'^) — ^ "^ieiPii^) — 
K — 1. From "^i^jPi = K — 1 we get 

^{l-p^) = l, (2) 
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so that the cohection {1 — pjjjgx defines a probability distribution on T that 
we denote by P*. Note that there exists a pi not equal to 1 (otherwise we 
have ^YLi^xVi = ^\ i^i particular, wc have always -pi < 1 for the index i 
that corresponds to the smallest in absolute value Pi. On the other hand, 
Pi > since /3i 7^ for z G I. Therefore, < < 1 for at least two indices 
i corresponding to the two smallest in absolute values coordinates of p. 

Definition of the randomized selection procedure. Choose i* from I at 
random according to distribution P*: P*(i* = i) = 1 — pj, i G X. We 
suppose that the random variable i* is independent of the data y. Define 
a randomized estimator /3* = . . . , /?*) where = 0, = Pi/pi for 
i ET\{i*}, and P* = for i ^ X. In words, we set to zero one coordinate 
of P chosen at random, and the other coordinates are either increased in 
absolute value or left intact. We will sec that on the average wc do not loose 
much in prediction quality by dropping a single coordinate in this way. 

We then perform the same randomization process taking P* as initial 
estimator and taking randomization independently of the one used on the 
first step. We thus drop one more coordinate, etc. Continuing itcratively 
after K — K steps we are left with the estimator which has exactly the pre- 
scribed number K of non-zero coordinates. We denote this final randomized 
estimator by p. This is the one we are interested in. 

Denote by E* the expectation operator with respect to the overall ran- 
domization measure which is the product of randomization measures over 
the K — K iterations. 

Theorem 3.1 Let Z G q^^p 5g ^ given matrix. Suppose that the diagonal 
elements of the corresponding Gram matrix Z'Z/n are equal to 1, and let P 
he any estimator with > 3 non-zero components. Then the randomized 
estimator P having at most K < K non-zero coordinates has the following 
properties. 

(i) For any vector f G TZ^, 

E* ||f - zpf < ||f - z~pf + wpwl - J—^ . 

(a) Let he the coordinates of P ordered hy ahsolute value: \P{i)\ > 
\P{2)\ > ■■■ > \P{p)\- Suppose that \P(^k)\ > \\P\\i/{K - 1) for some 
k. Then the estimator P coincides with P in the k largest coordinates: 

Aj) = hj)' J = l,•••,^• 
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(in) Suppose that |/3(fc+i)| = and |/3(fc)| > — 1) for some k. Then 

(3 keeps all the non-zero coordinates of (5. 

Proof. It is easy to see that E*(/3*) = /3j for all i and, for any vector f g 7?-", 

E* llf - Z/3*f = llf - Z/3f + - tracefZ'ZS*) 

n 

n 

= ||f-Z/3||2 + -VzJS*z, 
n ^ — ' 



<||f-Z/3f + ^^, 



n-- ^ (3) 

.2 1 - Pj 



1=1 

P 



where Zj are the rows of matrix Z and S* = E*[(/3* — (3){(3* — /?)'] is the 
randomization covariance matrix. We used here that S* is of the form 

S* = diag - {Bi3)iBpy with B = diag (^^) > 

and the diagonal elements of Z'Z/n are equal to 1, by assumption of the 
theorem. 

Recall that c > 1, and therefore > \\(3\\i/{K — 1) implies pj = 1. 
Hence, 

^ 0<|/3,|<||^||i/{A'-l) ^ 

T 1/3,1(1 -p.) 

0<|/3,|<||/3||i/(i^-l) 



i6i 



(K-1)^ 



where we used ([2]). Thus, the randomized estimator /5* with at most K — 1 
non-zero components satisfies 

E*||f-Z/3*||2 < ||f-Z/3||2 + J^M_. (5) 
II II - II Mil ^K-iy ^ ' 
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Note also that (3* has the same £i norm as the initial estimator (5: 

ll/31li = ll/3||i (6) 
In fact, the definition of (3* yields 

iriii-ii/3iii 



in view of [5J 

Using ([5|) and ^ and continuing by induction we get that the final 
randomized estimator (3 satisfies 

E* llf - Z/3|P < llf - Z/3|P + y J'^"^ 

II /^ll -II P\\ A. (^_^.)2 

<l|f-z^f + «(^-^). 

This proves part (i) of the theorem. Part (ii) follows easily from the defi- 
nition of our procedure, since pj = 1 for all the indices j corresponding to 
, . . . , and the ii norm of the estimator is preserved on every step of 
the iterations. The same argument holds for part (iii) of the theorem. □ 

Consider now the linear model of Section [2j Let (3 be an estimator of 
parameter (3q. Using Theorem 13.11 with f = Z/3o we get the following bound 
on the prediction loss of the randomized estimator (3: 

E*||Z(^-/?o)f <||Z(/3-/3o)f + . (7) 

We see that if K is large enough and the norm is bounded, the difference 
between the losses of /? and (3 is on the average not too large. For (3 = (3^ 
we can replace by in ([7]). 

10 



E 


Pj 


1 w\ 

Pi* 


11/31 


li 


Pj<l \ 


{K - 


-1) 


m 


li 


i6X 


{k - 


-1) 



J ex 



I <K- 1) 



c{K - 1) 



0, 



As P we may also consider another LASSO type estimator which is some- 
what different from Pl described in Section [2) 



(3 = arg min { | 



y-Z/3f + r|l/3||i} 



(8) 



where r = Ay^ {log p)/n with some constant A > large enough. As shown 



in Bickel, Ritov and Tsybakov (2007) , for this estimator, as well as for the 
associated Dantzig selector, under general conditions on the design matrix Z 
the ii norm satisfies = \\Po\\i + Op{sy/ {logp)/n) where s is the number 
of non-zero components of Pq. Thus, if Pq is sparse and has a moderate £i 
norm, the bound ([7]) can be rather accurate. 

Furthermore, Theorem 13.11 can be readily applied to nonparametric re- 
gression model 

y = f + £ 

where f = (/(Xi), . . . , /(X„))' and / is an unknown regression function. In 
this case Z/3 = //^(X) is an approximation of /(X), for example as the one 
discussed in the Introduction. Then, taking as P either the LASSO estima- 
tor ([5]) or the associated Dantzig selector we get immediately sparsity oracle 
inequalities for prediction loss of the corresponding randomized estimator 
P that mimic (to within the residual term 0(||/3||f /i^)) those obtained for the 



LASSO in Bunea, Tsybakov and Wegkamp (2007a) Bickel, Ritov and Tsybakov (2007) 



and for the Dantzig selector in Bickel et al. (2007) 



It is interesting to compare our procedure with the randomization device 
usually referred to as the "Maurey argument" . It is implemented as a tool to 



prove approximation results over convex classes of functions Barron (1993) 



Maurey's randomization has been used in statistics in connection to con- 
vex aggregation Nemirovski (2000) , pages 192-193 (iC-concentrated aggre- 
gation), and Bunea, Tsybakov and Wegkamp (2007a) , Lemma B.l. 



The Maurey randomization can be also applied to our setting. Define 
the estimator Pm as follows: 

(i) choose K < K; draw independently at random K coordinates from 2 
with the probability distribution 

(ii) set the jth coordinate of Pm equal to 



Pmj 



ikj/K if/3j>0, 
" ikj/K {i~Pj < 0, 
ifj^X 
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where kj < K is the number of times the jth coordinate is selected at 
step (i). 

Note that, in general, none of the non-zero coordinates of Pm is equal to the 
corresponding coordinate of the initial estimator /3. The prediction risk of 
Pm is on the average not too far from that of /? as the next theorem states. 

Theorem 3.2 Under the assumptions of Theorem \3.1\ the randomized esti- 
mator Pm with at most K < K non-zero coordinates satisfies 

E*||f-Z^Mf< ||f-Z/3f + ME. (9) 

K 

Proof. Let rji,...,rjK be i.i.d. random variables taking values in I with 
the probability distribution {|/3j|/||/3||i}jei- We have kj = Ylf=iHVs = j) 
where /(•) is the indicator function. It is easy to see that E*(/3jv/j) = Pj and 
the randomization covariance matrix S* = E*[(/3m — P){Pm — P)'] has the 
form 



ldmg\~p,\-h~p\\~p\' (10) 



K °" ' K 

where \P\ is the vector of absolute values \Pi\. Acting as in ([3]) and using 
([To]) we get 



1 " 

E* ||f - ZPuf = ||f - Z/3f + - ^z^S 



n 

i=l 



<||f-Z/3f + Mi ^1/3,1 

which yields the result. □ 

The residual term in ([9]) is of the same order of magnitude 0(||/3||f /X) 
as the one that we obtained in Theorem 13.11 In summary, Pm does achieve 
the properties (A) and (B) mentioned at the beginning of this section, but 
not the property (C): it does not preserve the largest coefficients of p. 

Finally, note that applying ([5]) with f = y we get an inequality that links 
the residual sums of squares (RSS) of P* and /3: 

E*||y-Z/?*f < lly-Z^f + J^l^. (11) 
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The left hand side of (jlip is bounded from below by the minimum of the 
RSS over all the vectors (5 with exactly K — \ non-zero entries among the K 
possible positions where the entries of the initial estimator (3 are non-zero. 
Hence, the minimizer j3** of the residual sums of squares ||y — Z/3|p over all 
such /? is an estimator whose RSS does not exceed the right hand side of 
pip . Note that (3** is obtained from (3 by dropping the coordinate which 
has the smallest contribution to B?. Iterating such a procedure K — K times 
we get nothing but a standard backward selection. This is exactly what we 
apply in Section HI However, the estimator obtained by this non-randomized 
procedure has neither of the properties stated in Theorem I3.1b ince we have 
only a control of the RSS but not necessarily of the prediction loss, and 
the ^l norm of the estimators is not preserved from step to step, on the 
difference from our randomized procedure. 

4 Examples 

We consider here two examples of application of our method. The first one 
deals with simulated data. 

Example 4.1 We considered a sample of size 250 from {Y,Xi, . . . ,Xiq), 
where Xi, . . . , Xiq are i.i.d. standard uniform, Y = < Xi < j) + 

/32l(| <X2< < Xs < < ^4 < I) + e, where l(-) denotes 

the indicator function and e is normal with mean and variance such that 
the population is 0.9. The coefficients Pi and /?2 were selected so that 
the standard deviation of the second term was three times that of the first. 

We followed the hierarchical method (i)-(iii) of the Introduction. Our 
initial set !Fq was a collection of L = 32 step functions for each of the ten 
variables {d = 10). The jump points of the step functions were equally 
spaced on the unit interval. The cardinality of J-q was 279 (after taking care 
of multicolinearity) . At each step we run the LASSO path until K = 40 
variables were selected, from which we selected K = 20 variables by the 
standard backward procedure. Then the model was enlarged by including 
interaction terms, and the iterations were continued until there was no in- 
crease in R^. 

The first step (with single effects only) ended with R^ = 0.4678, and 
the correlation of the predicted value of Y with the true one was 0.4885. 
The second iteration (two way interactions) ended with R^ = 0.6303 and 
correlation with the truth of 0.6115. The third (three and four ways inter- 
actions were added) ended with R^ = 0.7166 and correlation of 0.5234 with 
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the truth. The process stopped after the fifth step. The final predictor had 
correlation of 0.5300 with the true predictor. 

The LASSO regularization path for the final (fifth) iteration is presented 
in Figure [2j The list of 20 terms included in the model is given in the legend 
where denotes the the A:th step function of variable i. The operator x 
denotes interaction of variables. We can observe that the first 12 selected 
terms are functions of variables 1 to 4 that are in the true model. Some 
of the 20 terms depend also on two other variables (8 and 10) that do not 
belong to the true model. 

Example 4.2 (The Abalone Data) The abalone data set, taken from 
[ftp : //ftp . ics .uci . edu/pub/machine-learning-databases/abalone/, 

gives the age of abalone (as determined by cutting the shell and counting 
the number of rings) and some physical measurements (sex, length, diame- 
ter, height, whole weight, weight of meat, gut weight, and shell weight after 
being dried). The data was described initially by Nash, et al in 1994. We 
selected at random 3500 data points as a training set. The 677 remaining 
points were left as a test bed for cross-validation. 

We used as a basic function of the univariate variable the ramp function 
(x — a)l(x > a). The range of the variables was initially normalized to 
the unit interval, and we considered all break points a on the grid with 
spacing 1/32. However, after dropping all transformed variables which are 
in the linear span of those already found, we were left with only 17 variables. 
We applied the procedure with LASSO which ends with at most K = 60 
variables, from which at most = 30 were selected by backward regression. 

The first stage of the algorithm ends with = 0.5586 (since we started 
with 17 terms and we were ready to leave up to 30 terms, nothing was 
gained in this stage). The second stage, with all possible main effects and 
two-way interactions, dealt already with 70 variables and finished with only 
slightly higher (0.5968). The algorithm stopped after the fifth iteration. 
This iteration started with 2670 terms, and ended with R^ = 0.5779. The 
correlation of the prediction with the observed age of the test sample was 
0.5051. The result of the last stage is given in Figure [3l It can be seen 
that the term with the largest coefficient is that of the whole weight. Then 
come 3 terms involving the meat weight, and its interaction with the length. 
The shell weight which was most important when no interaction terms were 
allowed, became not important when the interactions were added. 
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Figure 1: 



Selecting variables. Coefficients vs. 



Regression coefficients 




Figure 2: The final path of the LASSO algorithm for the simulation of 
Example 14. 1[ 
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Figure 3: The final path of the LASSO algorithm for the abalone data set. 
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